Abstract

这篇论文提出了一个新的恢复混淆的方法，对象是 Android APK , 基于概率性的’大代码’学习（probabilistic learning of large code）的这么一种方式。

核心 Idea 是通过上千种没有经过混淆的 Android软件，学习一种概率性的模型，然后利用这种模型去恢复新的，没见过的Android APK。

这篇文章的核心关注点是恢复布局混淆(layout obfuscation)。
注：( 混淆技术分为几种：layout obfuscation（布局混淆）， Control obfuscation（控制流混淆）, Data Obfuscation（数据混淆）and Preventive Obfuscation（预防混淆）.)

布局混淆

是一种曾经很流行如今也在使用但并不高深的一种混淆技术。它会重命名程序的元素，例如：classes，packages 和 methods ，使得理解程序的代码变得困难。

具体的来讲这篇论文里：

在概率性的图像化模型中，词组化 Android APK的布局混淆问题
用丰富的特点集和捕获 Android setting的约束条件举例说明这个模型，既能确保语义等价和又能维持预测的高精准性。
显示如何调节有力的推理和学习算法两者的平衡去实现总体的预测和可拓展的概率性预测。

作者提出了他们的方法用一款叫： DEGUARD 的工具，使用它：

反逆向已经使用了非常流行的叫做 ProGuard 的良性的，开源的布局混淆工具的软件混淆过的 APK 。
推测导入的三方库的良性 APK。
重命名经过了混淆的Android malware元素的名字。

实验结果证明 DEGUARD 实践效果非常高效：他可以恢复 79.1% 经过 ProGuard混淆过的元素名字，91.3% 经混淆的引入的第三方库。而且他在恶意软件中揭示了处理敏感数据的字符串解码器和类名。

介绍（Introduction）

这篇论文提出了一种新的方法，基于概率模型恢复经过混淆的Android Application的这么一种方法。我们的方法是使用在公共仓库（public repositories）中存在的大量的Android程序（被称为 “Big Code”）去学习，得到强有力的能捕获没有经过混淆的 Android 程序的核心特征的概率模型。然后使用这个概率模型去议题一种（概率性的）恢复混淆的的提案，对于已经混淆了的 Android applications。
我们的方法使得多样的安全的软件变为可能。例如，我们的系统成功的恢复了经过 ProGuard 混淆过的Android APK。

焦点：布局混淆（Focus：Layout Deobufscation）

这篇论文的关注点在于恢复 Android APK 的布局混淆。一般的混淆技术包括其他的混淆方法（例如改变数据的表现方法，改变控制流等），布局混淆仍然保持着几乎所有的混淆工具的核心部分。

在布局混淆中，程序元素的名字持有重要的语义信息，被其他的没有语义信息的标识符所替代。例如，变量名，方法名，类名。重命名这些程序元素使得分析人员读和理解程序的代码变得困难无比（-，- 比如上个学期读smali代码的时候看到一堆a,b,c,aa,ab,ac时心情有多难过。。）同时对于很多的安全情境下（比如保护知识产权也就是防止被偷代码）也都非常有用。

优点和挑战(Benefit and Challenges)

这其中，恢复 Android apk 的布局混淆有着以下几类好处：

它使得安全分析师检查被 ProGuard 混淆过的 Android App变得更加简单。
它可以识别 Android Apk里嵌入的第三方库。
它对于代码中某一特定的标识符可以实现自动搜索。

然而，恢复布局混淆是一个难题。
原因是：一旦原有的名字在程序中被移除和被用缩写过的标识符替代的情况下，简单的单独的孤立的检查这么一个程序来恢复原有的名字可能性很小。

从“大代码”中的概率性学习（Probabilistic Learning from “Big Code”）

为了解决单独的考虑一个程序去恢复布局混淆很难的这么一个挑战，过去的几年里出现了新兴的静态工具：从”Big Code”中来学习的概率模型，然后利用这个模型去面向一些难以解决的任务提供可能的解决方案。这些难以解决的任务例如：程序语言间的机械翻译(machine translation between programming languages)，静态代码合成（statistical code synthesis），and 在源代码中猜测名字，类型（predicting names and types in source code）。
好玩儿的是，缘于他们独特的特性，一些概率性的系统在开发社区中快速变得流行。

我们的工作：通过 “大代码” 的 Android 反混淆（This work: Android Deobfuscation via “Big Code”）

以这些优点为动机，我们针对恢复 Android 布局混淆提出了一种新的方法，通过“学习”数千种易获得的，未经混淆的 Android App。

技术性的来讲，我们的方法的工作机制是：把预测被层次混淆过的标识符的名字的问题使用概率性图像模型分成若干有结构的预测。事实上，我们利用 Conditional Random Fields（CRFs） ,一个强有力的广泛应用于多种领域的模型（包括计算机影像处理，自然语言处理）。据我们所知，这是第一个利用从”大代码“学习的概率图象模型去处理核心安全类的问题。使用我们的方法，我们提出了一个工具叫做 DEGUARD , 它可以以高预测率的自动的恢复通过 ProGuard 层次混淆过的 Android Apk。

主要的贡献（Main Contributions）

一种结构性的预测方法用来执行概率性的 Android APK的层次混淆恢复。
能够干净的捕获 Android App 核心部分的特征和约束集。结合起来说，这些特征约束集可以确保我们的概率性预测的高预测利率和维持 App的原有语义。
一个复杂的大规模概率性系统叫 DEGUARD。
被 ProGuard 混淆过的在 open-source 上 Android App 和恶意 Android App 使用 DEGUARD 的评价和估值。我们的结果显示 DEGUARD 可以恢复被 ProGuard混淆过的 79.1% 的程序元素，以及识别 91.3% 使用的三方库，揭示相关字符串编码器和恶意软件的类。

综述（Overview）

在这一章节中我们对于我们的 Android静态反混淆方法提出了一种非正式的综述。第一，我们讨论 ProGuard，一直都很广泛被应用的 Android App 混淆工具。之后我们提出了 DEGUARD 系统的核心步骤。最后我们的目标是提供一种关于这个方法直观上的理解。全部的正式的细节在写一个章节里讨论。

ProGuard

ProGuard 混淆程序的元素包括：fields名，方法名，类名，包名，通过用语义难以理解的字符串替代原有的名字。它同样移除不使用的类，field，方法去最小化输出 APK 的大小。ProGuard 同样处理 app 和程序导入的三方库。程序导入三方库所以会隐藏在发布的 APK 中。

ProGuard 不可以混淆所有的程序元素，因为这样会改变程序原有的语义。例如： Android API 的方法名字和静态文件参考的类的名字，这种一旦修改就会引发引用错误等问题，会使得程序不正常运转。

用一些工具比如：apktool，Dex2Jar，JavaDecompiler 可以很轻易地获得 Android代码。所以当一些函数的名字变成：

1 2	a obj = new a(); obj.c(str);

就很难分析出代码的意图了，有好多啦。。。所以要看smali代码，看多了也就能看进去了。
值得注意的是：ProGuard 为了保护原有的程序语义不变保持了一些程序的名字没有改变，例如： SQLiteOpenHelper 和他的方法 getWritableDatabase 和 rawQuery，这些都是核心 Android API的一部分。

DeGuard

todo

Dependency Graph

todo

Syntactic and Semantic Constraints

todo

Probabilistic Prediction

todo

Security Applications

todo

Challenges

todo

Scope and Limitations

todo

Background

todo

Problem Statement

todo

Dependency Graph

todo

Features and Weights

todo

Conditional Random Fields

todo

Prediction via MAP Inference

todo

MAP Inference Example

todo

Learning from “Big Code”

todo

Feature Functions

todo

Program Elements

todo

Known and Unknown Program Elements

todo

Grouping Method Nodes

todo

Relationships

todo

Method Relationships

todo

Structural Relationships

todo

Comparison to Other Prediction Systems

todo

Pairwise Feature Functions

todo

Constraints

todo

Naming Constrains for Methods

todo

Example

todo

Expressing Method Constraints

todo

Deriving Inequality Constrains for Methods

todo

Result On the Example

todo

Naming Constrains for Fields,Classes,and Packages

todo

Implementation and Evaluation

todo

The DEGUARD System

todo

Feature Functions and Weights

todo

MAP Inference

todo

Experimental Evaluation

todo

ProGuard Experiments

todo

ProGuard-Obfuscated APKs

Task1: Predicting Program Element Names.

Task2: Predicting Third-party Libraries.

Prediction Speed

Summary of ProGuard Experiments

Experiments with Malware Samples

Revealing Base64 String Decoders

Revealing Sensitive Data Usage

Limitations

Probabilistic models for programs

#Conclusion

Abstract

布局混淆

介绍（Introduction）

焦点：布局混淆（Focus：Layout Deobufscation）

优点和挑战(Benefit and Challenges)

从“大代码”中的概率性学习（Probabilistic Learning from “Big Code”）

我们的工作： 通过 “大代码” 的 Android 反混淆（This work: Android Deobfuscation via “Big Code”）

主要的贡献（Main Contributions）

综述（Overview）

ProGuard

DeGuard

Dependency Graph

Syntactic and Semantic Constraints

Probabilistic Prediction

Security Applications

Challenges

Scope and Limitations

Background

Problem Statement

Dependency Graph

Features and Weights

Conditional Random Fields

Prediction via MAP Inference

MAP Inference Example

Learning from “Big Code”

Feature Functions

Program Elements

Known and Unknown Program Elements

Grouping Method Nodes

Relationships

Method Relationships

Structural Relationships

Comparison to Other Prediction Systems

Pairwise Feature Functions

Constraints

Naming Constrains for Methods

Example

Expressing Method Constraints

Deriving Inequality Constrains for Methods

Result On the Example

Naming Constrains for Fields,Classes,and Packages

Implementation and Evaluation

The DEGUARD System

Feature Functions and Weights

MAP Inference

Experimental Evaluation

ProGuard Experiments

ProGuard-Obfuscated APKs

Task1: Predicting Program Element Names.

Task2: Predicting Third-party Libraries.

Prediction Speed

Summary of ProGuard Experiments

Experiments with Malware Samples

Revealing Base64 String Decoders

Revealing Sensitive Data Usage

Limitations

Related Work

Probabilistic models for programs

我们的工作：通过 “大代码” 的 Android 反混淆（This work: Android Deobfuscation via “Big Code”）